Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parser): Filter out URLs before sending to pelias/model #225

Merged
merged 1 commit into from
Jul 3, 2019

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Jul 3, 2019

We have had numerous reports from Pelias users about concerning error message during builds regarding the URL regex filter from pelias/model#115.

While this filter is good, the resulting error message is alarming. Looking today at the output of a planet build, it appears that many of these errors come from the polylines file created by Valhalla out of the
OSM street network.

Looking at the contents of the polyline file and corresponding record on OSM, it seems that Valhalla puts the contents of the ref tag in the polyline file as an alternate name. The ref tag will
often contain a URL.

This means that not only will the error happen frequently, but many records that are actaully valid will be filtered out.

An example of this is the Iowa Women of Achievement bridge which is completely valid in terms of name, geometry, and tagging but contains a URL in the ref field.
Screenshot_2019-07-03_13-03-36

However the resulting line in the polylines file contains a URL as one name:

ejqinArl~pqDkA|T?bQ\pH\nIxB`QdE~SIowa Women of Achievement Bridgehttps://www.principal.com/riverwalk/iowa-woman-achievement-bridge.htm

The polylines importer currently selects a single name value from the list of names in the polylines file by choosing the longest, which will often be a URL.

This PR adds an additional filter that first removes any URL-like values from consideration, and should completely eliminate any of the otherwise concerning errors while ensuring all valid records make it into
Elasticsearch.

Fixes pelias/whosonfirst#456
Fixes #216
Fixes pelias/docker#89
Connects pelias/model#116

Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

We have had numerous reports from Pelias users about concerning error
message during builds regarding the URL regex filter from
pelias/model#115.

While this filter is good, the resulting error message is alarming.
Looking today at the output of a planet build, it appears that many of
these errors come from the polylines file created by Valhalla out of the
OSM street network.

Looking at the contents of the polyline file and corresponding record on
OSM, it seems that Valhalla puts the contents of the `ref` tag in the
polyline file as an alternate name. The [ref tag](https://wiki.openstreetmap.org/wiki/Key:ref?uselang=en-US) will
often contain a URL.

This means that not only will the error happen frequently, but many
records that are actaully valid will be filtered out.

An example of this is the [Iowa Women of Achievement
bridge](ttps://www.openstreetmap.org/way/65066830) which is completely
valid in terms of name, geometry, and tagging but contains a URL in the
`ref` field.

The polylines importer currently selects a single name value from the
list of names in the polylines file by choosing the longest.

This PR adds an additional filter that first removes any URL-like values
from consideration, and should completely eliminate any of the otherwise
concerning errors while ensuring all valid records make it into
Elasticsearch.

Fixes pelias/whosonfirst#456
Fixes #216
Fixes pelias/docker#89
Connects pelias/model#116
@orangejulius
Copy link
Member Author

I added another small change to this PR that ensures only records with a valid name will be sent to later stages of the import pipeline and therefore have Pelias model documents created.

This was needed because it's now possible for a valid polyline row (already defined in the code as a row with a valid polyline encoded string and at least one name) to have zero valid names after URLs are filtered out. This should also cut down on possible errors like those in #3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: invalid regex test invalid regex test invalid regex test
2 participants